文本VQA的开放式问题回答任务通常需要读取和推理图像中很少见或完全看不见的场景文本内容。我们通过提出广义使用外部知识来增强我们对场景文本的理解来解决问题的零射击性质。我们设计一个框架,使用标准的多模式变压器来提取,验证和理性,以了解视觉语言理解任务。通过经验证据和定性结果,我们证明了外部知识如何突出实例的线索,从而有助于应对培训数据偏见,提高答案实体类型的正确性并检测名为“实体”的多字。在类似上游OCR系统和培训数据的限制下,我们生成的结果与三个公开数据集的最新结果相当。
translated by 谷歌翻译
Foveated imaging provides a better tradeoff between situational awareness (field of view) and resolution and is critical in long-wavelength infrared regimes because of the size, weight, power, and cost of thermal sensors. We demonstrate computational foveated imaging by exploiting the ability of a meta-optical frontend to discriminate between different polarization states and a computational backend to reconstruct the captured image/video. The frontend is a three-element optic: the first element which we call the "foveal" element is a metalens that focuses s-polarized light at a distance of $f_1$ without affecting the p-polarized light; the second element which we call the "perifoveal" element is another metalens that focuses p-polarized light at a distance of $f_2$ without affecting the s-polarized light. The third element is a freely rotating polarizer that dynamically changes the mixing ratios between the two polarization states. Both the foveal element (focal length = 150mm; diameter = 75mm), and the perifoveal element (focal length = 25mm; diameter = 25mm) were fabricated as polarization-sensitive, all-silicon, meta surfaces resulting in a large-aperture, 1:6 foveal expansion, thermal imaging capability. A computational backend then utilizes a deep image prior to separate the resultant multiplexed image or video into a foveated image consisting of a high-resolution center and a lower-resolution large field of view context. We build a first-of-its-kind prototype system and demonstrate 12 frames per second real-time, thermal, foveated image, and video capture in the wild.
translated by 谷歌翻译
Transformer layers, which use an alternating pattern of multi-head attention and multi-layer perceptron (MLP) layers, provide an effective tool for a variety of machine learning problems. As the transformer layers use residual connections to avoid the problem of vanishing gradients, they can be viewed as the numerical integration of a differential equation. In this extended abstract, we build upon this connection and propose a modification of the internal architecture of a transformer layer. The proposed model places the multi-head attention sublayer and the MLP sublayer parallel to each other. Our experiments show that this simple modification improves the performance of transformer networks in multiple tasks. Moreover, for the image classification task, we show that using neural ODE solvers with a sophisticated integration scheme further improves performance.
translated by 谷歌翻译
This paper presents SVAM (Sequential Variance-Altered MLE), a unified framework for learning generalized linear models under adversarial label corruption in training data. SVAM extends to tasks such as least squares regression, logistic regression, and gamma regression, whereas many existing works on learning with label corruptions focus only on least squares regression. SVAM is based on a novel variance reduction technique that may be of independent interest and works by iteratively solving weighted MLEs over variance-altered versions of the GLM objective. SVAM offers provable model recovery guarantees superior to the state-of-the-art for robust regression even when a constant fraction of training labels are adversarially corrupted. SVAM also empirically outperforms several existing problem-specific techniques for robust regression and classification. Code for SVAM is available at https://github.com/purushottamkar/svam/
translated by 谷歌翻译
Search and rescue, wildfire monitoring, and flood/hurricane impact assessment are mission-critical services for recent IoT networks. Communication synchronization, dependability, and minimal communication jitter are major simulation and system issues for the time-based physics-based ROS simulator, event-based network-based wireless simulator, and complex dynamics of mobile and heterogeneous IoT devices deployed in actual environments. Simulating a heterogeneous multi-robot system before deployment is difficult due to synchronizing physics (robotics) and network simulators. Due to its master-based architecture, most TCP/IP-based synchronization middlewares use ROS1. A real-time ROS2 architecture with masterless packet discovery synchronizes robotics and wireless network simulations. A velocity-aware Transmission Control Protocol (TCP) technique for ground and aerial robots using Data Distribution Service (DDS) publish-subscribe transport minimizes packet loss, synchronization, transmission, and communication jitters. Gazebo and NS-3 simulate and test. Simulator-agnostic middleware. LOS/NLOS and TCP/UDP protocols tested our ROS2-based synchronization middleware for packet loss probability and average latency. A thorough ablation research replaced NS-3 with EMANE, a real-time wireless network simulator, and masterless ROS2 with master-based ROS1. Finally, we tested network synchronization and jitter using one aerial drone (Duckiedrone) and two ground vehicles (TurtleBot3 Burger) on different terrains in masterless (ROS2) and master-enabled (ROS1) clusters. Our middleware shows that a large-scale IoT infrastructure with a diverse set of stationary and robotic devices can achieve low-latency communications (12% and 11% reduction in simulation and real) while meeting mission-critical application reliability (10% and 15% packet loss reduction) and high-fidelity requirements.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
Recent research has demonstrated the capability of behavior signals captured by smartphones and wearables for longitudinal behavior modeling. However, there is a lack of a comprehensive public dataset that serves as an open testbed for fair comparison among algorithms. Moreover, prior studies mainly evaluate algorithms using data from a single population within a short period, without measuring the cross-dataset generalizability of these algorithms. We present the first multi-year passive sensing datasets, containing over 700 user-years and 497 unique users' data collected from mobile and wearable sensors, together with a wide range of well-being metrics. Our datasets can support multiple cross-dataset evaluations of behavior modeling algorithms' generalizability across different users and years. As a starting point, we provide the benchmark results of 18 algorithms on the task of depression detection. Our results indicate that both prior depression detection algorithms and domain generalization techniques show potential but need further research to achieve adequate cross-dataset generalizability. We envision our multi-year datasets can support the ML community in developing generalizable longitudinal behavior modeling algorithms.
translated by 谷歌翻译
In this research work, we have demonstrated the application of Mask-RCNN (Regional Convolutional Neural Network), a deep-learning algorithm for computer vision and specifically object detection, to semiconductor defect inspection domain. Stochastic defect detection and classification during semiconductor manufacturing has grown to be a challenging task as we continuously shrink circuit pattern dimensions (e.g., for pitches less than 32 nm). Defect inspection and analysis by state-of-the-art optical and e-beam inspection tools is generally driven by some rule-based techniques, which in turn often causes to misclassification and thereby necessitating human expert intervention. In this work, we have revisited and extended our previous deep learning-based defect classification and detection method towards improved defect instance segmentation in SEM images with precise extent of defect as well as generating a mask for each defect category/instance. This also enables to extract and calibrate each segmented mask and quantify the pixels that make up each mask, which in turn enables us to count each categorical defect instances as well as to calculate the surface area in terms of pixels. We are aiming at detecting and segmenting different types of inter-class stochastic defect patterns such as bridge, break, and line collapse as well as to differentiate accurately between intra-class multi-categorical defect bridge scenarios (as thin/single/multi-line/horizontal/non-horizontal) for aggressive pitches as well as thin resists (High NA applications). Our proposed approach demonstrates its effectiveness both quantitatively and qualitatively.
translated by 谷歌翻译
The paper describes the work that has been submitted to the 5th workshop on Challenges and Applications of Automated Extraction of socio-political events from text (CASE 2022). The work is associated with Subtask 1 of Shared Task 3 that aims to detect causality in protest news corpus. The authors used different large language models with customized cross-entropy loss functions that exploit annotation information. The experiments showed that bert-based-uncased with refined cross-entropy outperformed the others, achieving a F1 score of 0.8501 on the Causal News Corpus dataset.
translated by 谷歌翻译
Long-range context modeling is crucial to both dialogue understanding and generation. The most popular method for dialogue context representation is to concatenate the last-$k$ previous utterances. However, this method may not be ideal for conversations containing long-range dependencies. In this work, we propose DialoGX, a novel encoder-decoder based framework for conversational response generation with a generalized and explainable context representation that can look beyond the last-$k$ utterances. Hence the method is adaptive to conversations with long-range dependencies. The main idea of our approach is to identify and utilize the most relevant historical utterances instead of the last-$k$ utterances in chronological order. We study the effectiveness of our proposed method on both dialogue generation (open-domain) and understanding (DST) tasks. DialoGX achieves comparable performance with the state-of-the-art models on DailyDialog dataset. We also observe performance gain in existing DST models with our proposed context representation strategy on MultiWOZ dataset. We justify our context representation through the lens of psycholinguistics and show that the relevance score of previous utterances agrees well with human cognition which makes DialoGX explainable as well.
translated by 谷歌翻译